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Abstract —Tucker decomposition is the cornerstone of modern machine learning on tensorial data analysis, which have attracted 
considerable attention for multiway feature extraction, compressive sensing, and tensor completion. The most challenging problem is 
related to determination of model complexity (i.e., multilinear rank), especially when noise and missing data are present. In addition, 
existing methods cannot take into account uncertainty information of latent factors, resulting in low generalization performance. To 
address these issues, we present a class of probabilistic generative Tucker models for tensor decomposition and completion with 
structural sparsity over multilinear latent space. To exploit structural sparse modeling, we introduce two group sparsity inducing priors 
by hierarchial representation of Laplace and Student-t distributions, which facilitates fully posterior inference. For model learning, we 
derived variational Bayesian inferences over all model (hyper)parameters, and developed efficient and scalable algorithms based on 
multilinear operations. Our methods can automatically adapt model complexity and infer an optimal multilinear rank by the principle of 
maximum lower bound of model evidence. Experimental results and comparisons on synthetic, chemometrics and neuroimaging data 
demonstrate remarkable performance of our models for recovering ground-truth of multilinear rank and missing entries. 

Index Terms —Tensor decomposition, tensor completion, Tucker decomposition, multilinear rank, structural sparsity, MRI completion 
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1 Introduction 

T ENSOR decomposition aims to seek a multilinear la¬ 
tent representation of multiway arrays, which is a 
fundamental technique for modern machine learning on 
tensorial data mining. In contrast to matrix factorization, 
tensor decomposition can effectively capture higher order 
structural properties by modeling multilinear interactions 
among a group of low-dimensional latent factors, leading 
to significant advantages for multiway dimension reduc¬ 
tion 11 j, multiway regression/classification j2j, ( 3 ), and com¬ 
pressive sensing G) or completion of structured data [5[. 
In particular. Tucker |6) and Candecomp/Parafac (CP) |7) 
decompositions have attracted considerable interest in var¬ 
ious fields, such as chemometrics, computer vision, social 
network analysis, and brain imaging m @ d 

Tucker decomposition represents an TVth-order tensor 
by multilinear operations between N factor matrices and 
a core tensor. If we view factor matrices as basis of N latent 
space, then core tensor represents a multilinear coefficient 
for the whole tensor. An alternative perspective is that 
core tensor is viewed as basis of a latent manifold space, 
then factor matrices represent low-dimensional coefficients 
corresponding to N different profiles for each data point. 
Tucker decomposition can exactly represent an arbitrary 
random tensor. In contrast, CP decomposition enforces a 
strict structure assumption by a super-diagonal core tensor. 
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which is suitable for highly structured data but with limited 
generalization ability for various types of data. Tensor de¬ 
composition becomes interest only when underlying infor¬ 
mation has an intrinsic low-rank property, leading to a com¬ 
pact representation in latent space. Hence, one fundamental 
problem is separating the information and noise space by 
model selection, i.e., determination of tensor rank, which is 
challenging when noise variance is comparable to that of 
information. Note that tensor rank differs from matrix rank 
in many mathematic properties 111 J, fl2|, |13|. Numerous 
studies proposed tensor decomposition methods by either 
optimization techniques or probabilistic model learning |8|, 
191, 114), |151, 1161, (17). However, tensor rank is always re¬ 
quired to be specified manually. In addition, general model 
selection methods need to perform tensor decomposition for 
each predefined model, which is inapplicable for Tucker 
model due to that the number of possibilities increases 
exponentially with the order of tensors. Some preliminary 
studies on automatic model determination were exploited 
in (18), (T9) which however, may prone to overffing and fail 
to recover true tensor rank in the case of highly noisy data, 
due to point estimation of model parameters. 

Tensor completion can be formulated as tensor decom¬ 
position with missing values, which seems fairly straightfor¬ 
ward but has several essential differences. Tensor decompo¬ 
sition focusing on data representation is referred to unsu¬ 
pervised technique, whereas tensor completion focusing on 
predictive ability is referred to semi-supervised technique, 
which aims to predict tensor elements from their TV-tuple 
indices. Another important issue is that the optimal tensor 
rank for tensor completion is related to missing ratio and 
may differ from the underlying true rank that is always 
optimum for tensor decomposition. Hence, the key problem 
of tensor completion is to choose an appropriate tensor rank 
(i.e., model complexity) that can obtain optimal generaliza- 
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tion performance, whereas this problem is more challenging 
as compared to tensor decomposition for a fully observed 
tensor. More specifically, existing model selections including 
cross-validation are not applicable due to correlations be¬ 
tween the optimal tensor rank and number of observed data 
points. Although numerous tensor decomposition methods 
for partially observed tensor were developed in |20|, |2l) , 
(22) , (23) , (24) , (25) , [261, the most challenging task of speci¬ 
fying tensor rank was conveyed to user, and thus was done 
mostly based on an implicit assumption that missing data 
is known. Several studies have considered automatic rank 
determination for CP decomposition (5), (27) . In contrast 
to CP rank, multilinear rank has more degree of freedom. 
An alternative framework for tensor completion, based on 
a low-rank assumption, was developed by minimizing the 
nuclear norm of approximate tensor, which corresponds to 
a convex relaxation of rank minimization problem (28) . Al¬ 
though this technique successfully avoids specifying tensor 
rank manually, several tuning parameters are required and 
sensitive to missing ratio. Hence, it essentially transformed 
the model selection problem to a parameter selection prob¬ 
lem. The nuclear norm based tensor completion was shown 
to be attractive in recent years |29|, (30), |31|, (32), |33|, (34) . 
Many variants by imposing additional constraints were also 
exploited in |35|, (36) which have shown advantages for 
some specific type of data. However, the best performance 
was mostly obtained by carefully tuning parameters based 
on implicit assumption that missing data is known. Another 
issue is that the definition of nuclear norm of tensor cor¬ 
responds to a (weighted) summation of mode-n rank R n 
denoting the dimension of latent factor matrices. However, 
the dimension of core tensor Yl n R n represents the model 
complexity of whole tensor as described previously. As a 
result, another possible framework can be introduced by 
optimization of logarithm transformed nuclear norm. 

In this paper, we present a class of generative models for 
Tucker representation of a complete or incomplete tensor 
corrupted by noise, which can automatically adapt model 
complexity and infer an optimal multilinear rank solely 
from observed data. To achieve automatic model determi¬ 
nation, we investigate structural sparse modeling through 
formulating Laplace as well as Student-t distributions in a 
hierarchial representation to facilitate full posterior infer¬ 
ence, which therefore can be further extended to enforce 
group sparsity over factor matrices and structural sparsity 
over core tensor. For model learning, we derive the full 
posterior inference under a variational Bayesian framework, 
including remarkable inference for the non-conjugate hi¬ 
erarchical Laplace prior. Finally, all model parameters can 
be inferred as well as predictive distribution over missing 
data without needing of tuning parameters. In addition, we 
introduce several Theorems based on multilinear operations 
to improve computational efficiency and scalability. 

The rest of this paper is organized as follows: Section |2| 
presents notations and multilinear operations. In Section pT 
we present group sparse modeling by hierarchical sparsity 
inducing priors. Section [4] presents Bayesian Tucker model 
for tensor decomposition, while Section [5] presents Bayesian 
Tucker model for tensor completion. The algorithm related 
issues are discussed in Section [6] Section [7] shows experi¬ 
mental results followed by conclusion in Section [8] 


2 Preliminaries and Notations 

Let Af,X,x denote a tensor, matrix, vector respectively. 
Given an TVth-order tensor X E R Jl x^x-xi^ its 
(zi,..., i;v)th entry is denoted by Xi x ...i N , where i n = 
1,..., J n , n = 1,..., N. The standard Tucker decomposition 
is defined by 

T = 6xi x 2 U (2) x • • • x tv U^. (1) 

E are a set of mode-n factor ma¬ 

trices, Q E R Rl xR z x --- xR n denotes the core tensor and 
(i?i,..., Rn) denote the dimensions of mode-n latent 
space, respectively. The overall model complexity can be 
represented by n n or R n , whose minimum as¬ 
sociated values {R n }n=i is termed as multilinear rank of 
tensor X |9|. For a specific LJ( n ), we denote its row vec¬ 
tors by {u^|i n = 1 ,...,/ n } and its column vectors by 
{u -rl\r n = 1 

Definition 2.1. Let {U 1 '”- 1 £ M In y R " } l denote a set of 
matrices, the sequential Kronecker products in a reversed 
order is defined and denoted by 

(g)U w = U (JV) ®U (JV_1) <g>---<g>U (1) . 

n 

0 u (fe) = U (JV) ® ® u (ra+1) ® u (n_1) ® • • • <g> u (1) . 

k^n 

The symbol (8) denotes Kronecker product. 0 n is a 
matrix of size ([[ n I n x n n R n)- 

The Tucker decomposition (T) can be also represented by 
using matrix, vector, or element-wise forms, given by 

X (n) =U^G (n) ((g)uW T Y 
\k^n J 

vec(Af) = ^(8)U (n) 'jvec(0), (2) 

' n ' 

= ((g^W). 

' n ' 

It should be noted that the multilinear operation is signifi¬ 
cantly efficient for computation. For example, if we compute 
0 n U( n ) firstly and then multiply it with vec (G), both the 
computation and memory complexity is O (Yl n InRn)- I n 
contrast, if we apply a sequence of multilinear operations 
(•) x n LJ( n ) without explicitly computing Xj( n ), the 
computational complexity is O (min n (R n ) Yl n I n ) while the 
memory cost is 0(Y\ n I n ). In this paper, we use notation 
0 n (-) frequently for clarity, however, the implementation 
can be performed by using multilinear operations. 

3 Hierarchical Group Sparsity Priors 

The sparsity inducing priors are considerably important 
and powerful for many machine learning models. The most 
popular ones are Laplace, Student-t, and Spike and slab 
priors. However, these priors are often not conjugate with 
the likelihood distribution, which leads to difficulties for 
fully Bayesian inference. In contrast, another popular spar¬ 
sity inducting prior is automatic relevance determination 
(ARD), which has been widely applied in many powerful 
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methods such as relevance vector machine and Bayesian 
principle component analysis (37) , (38) , |39|. The advantages 
of ARD prior lie in its conjugacy resulted from a hierarchical 
structure. Note that ARD prior is essentially a hierarchical 
Student-t distribution since its marginal distribution is 

oo 

V(®|0, A“ 1 )Ga(A|a, b) dX = T(x |0, ab~\ 2a). (3) 

Ga(x\a, b) = f^jX a ~ 1 e~ bx denotes a Gamma distribution 
and T(x) denotes a Student-t distribution. One can show 
that when a noninformative prior is specified by a = b 0, 
then p{x) oc l/\x\, which indicates the sparsity inducing 
property of a hierarchial Student-t prior. 

A straightforward question is that can we represent 
Laplace prior by the hierarchical distributions which are 
conjugate? As shown in (40), |4l|, a hierarchical structure 
of Gaussian and Exponential distributions yields a Laplace 
marginal distribution, whereas they are not conjugate pri¬ 
ors. In this paper, we present another hierarchial Laplace 
distribution by employing an Inverse Gamma distribution 
IG(x\a,b) = x~ a ~ 1 e~ bx . By a specific setting of 

/G(l, T) = 'l x ~ 2 e~i x 1 , we can show that the marginal 
distribution is 

r°° , /v 1 

/ AT(x\0, A _1 )/G(A|1, j-) dX = Laplace (x|0, —). (4) 

Jo 2 v7 

Note that 7 govern the degree of sparsity, for example, 
if 7 = 1, then p{x) oc e~\ x K To avoid parameter selec¬ 
tion manually, we can place a noninformative prior over 
7 and employ Bayesian inference for posterior estimation. 
Although inverse Gamma is a non-conjugate prior, we can 
infer its variational posterior represented by a more gener¬ 
alized distribution. 

Let x = {xi,..., xr] denote a set of random variables, 
the hierarchical sparsity inducing priors can be specified as 
Vr = 1,... ,R, 

Student-t: x r ~ A/*(0, A” 1 ), A r ^Ga(a,6), (5) 


Laplace: x r ~ J\f(0, X r 1 ), A r ~ JG(1, ^). (6) 

Hence, the marginal distributions of x can be found by 

{£}, (Q as p(x) = [J^ =1 T(x r jO,ab~ 1 ,2a) and p(x) = 
nLi Laplace(x r |0, ^=). 

Now we consider to extend the above hierarchical priors 
for group sparse modeling. Let X — {xi,... , x#} denote 
R groups of random variables where x r E R Ir denote rth 
group that contains I r random variables. Note that Ii = 
I2 • • • = Ir is not necessary. The group sparse modeling is 
to enforce sparsity on groups in contrast to the individual 
random variables, which effectively take into account the 
clustering properties of relevant variables. To employ the 
hierarchial sparsity priors for group sparse model, it can be 
specified as Vr = 1,..., R, 


Student-t: 

x r ~ V(0, A r 1 I/ r ), 

A r ^ Ga(a, 6), 

(7) 

Laplace: 

x r ~.V(o,A;r 1 ij I .), 

A r ~7G(l,^). 

(8) 


Therefore, the marginal distributions of X can be de¬ 
rived as p(X) = Ur=iT^ r \0,ab-\2a) and p(X) = 


Ur=iLaplace(x r \O f -^), where T(x), Laplace(x ) denote 
a multivariate Student-t distribution and a multivariate 
Laplace distribution respectively (42) . One can show that 
when a = b 0, then Vr, p(x r ) oc (l/||x r 11 2 )- When 
7 — 1 , then Vr, p(x r ) oc e — ll x ^II 2 . 

In the following sections, we employ both hierarchial 
Student-t and hierarchial Laplace priors to impose group 
sparsity over multi-mode latent factors, yielding the mini¬ 
mum latent dimensions, for tensor decomposition models. 


4 Bayesian Tucker Model for Tensor De¬ 
composition 

4.1 Model specification 

We first consider the Bayesian Tucker model for an TVth- 
order tensor y E R 7 i x '" x/jv that is fully observed. We 
assume that y is a measurement of the latent tensor X 
corrupted by i.i.d. Gaussian noises, i.e., y = X +e, where X 
is generated exactly by the Tucker representation as shown 
in {TJ. Therefore, the observation model can be specified by 
a vectorized form. 


vec(y)|{u(")},g,r - Vn0U( n Avec(g), r-T) 


(9) 

Our objective is to infer the model parameters, as well as 
model complexity, automatically and solely from given data. 
To this end, we propose the hierarchical prior distributions 
over all model (hyper)parameters. For noise precision r, a 
Gamma prior can be simply placed with appropriate hyper¬ 
parameters, yielding an noninformative prior distribution. 

To modeling tensor data by an appropriate model com¬ 
plexity, it is important to design the flexible prior dis¬ 
tributions over the group of factor matrices {LJ( n )},n = 
1,..., N, and the core tensor Q, whose complexity can be 
adapted to a specific given tensor. Since the model complex¬ 
ity of Tucker decompositions depends on the dimensions of 
Q, denoted by (i?i, i? 2 , •.. , Rn\ while R ni n = 1,..., N 
also corresponds to the number of columns in model-n 
factor matrix and thus represents the dimensions of 

mode-n latent space. The minimal number of {R n }^ =1 i.e., 
multilinear rank (5) usually need to be given in advance. 
However, due to the presence of noise, the optimal selection 
of multlinear rank is quite challenging. Although some 
model selection criterions can be applied, the accuracy sig¬ 
nificantly depends on the decomposition algorithms, result¬ 
ing in less stability and high computational cost. Therefore, 
we seek an elegant automatic model selection, which can 
not only infer the multilinear rank, but also effectively avoid 
overfitting. To achieve this, we employ the proposed group 
sparsity priors over factor matrices. More specifically, each 
UW is govern by hyperparameters E R Rn , where 
controls the precision related to r n group (i.e., r n th column). 
Due to the group sparsity inducing property, the dimensions 
of latent space will be enforced to be minimal. On the other 
hand, the core tensor Q also needs to be as sparse as possi¬ 
ble. However, if we straightforwardly place an independent 
sparsity prior, the interactions between Q and {U^ n ^} can¬ 
not be modeled, which may lead to inaccurate estimation 
of multilinear rank. As G ri ,...,r N can be considered as the 


ZHAO et a!:. BAYESIAN SPARSE TUCKER MODELS FOR DIMENSION REDUCTION AND TENSOR COMPLETION 


4 


scalar coefficient of the rank-one tensor 0 • • • 0 
that involves r n th group from each of U^, respectively. 
Taken into account the fact that if 3n, 3r n ,u^J = 0, then 
G r r N should be also enforced to be zero, the precision 
parameter for G rii ...,r N can be specified as the product of 
precisions over which is represented by 

g ri ... rN \{\^},(3 ~ v(o, (^riA^)- 1 ) , (10) 

where (3 is a scale parameter related to the magnitude of 
G , on which a hyperprior can be placed. The hyperprior for 
play a key role for different sparsity inducing priors. 
We propose two hierarchial priors including Student-t and 
Laplace for group sparsity. Let = diag(A^), we can 
finally specify the hierarchial model priors as 


vec (G) | |A (n - ) | ,/3 

P 

u| n) |A {n) 

L n I 

Student-t: A^ 

Laplace: A^ 

7 

The prior for the core tensor G is written as a tensor- 
variate Gaussian distribution. The observation model in (9) 
and the hierarchial priors in (TT) are integrated, which is 
termed as Bayesian Tucker Decomposition (BTD) model. 
Note that there are two proposed hierarchial sparsity priors 
in BTD model, which are thus denoted respectively by BTD- 
T (Student-t priors) and BTD-L (Laplace priors). 

For simplicity, all unknown (hyper)parameters in 
BTD model are collected and denoted by 0 = 

{0,U( 1 ),...,U( jv ),A (1) ,...,A (jv) ,t, / 5,7}. Thus the joint 
distribution of BTD model is written as 

p(y,e) =p(y\{u^},g,T) n?>(u (n) l A(n) ) 

n 

xp(g\{A^},/3) nP(A (n) \l)p(l)p(fi)p{ T )■ (12) 

n 

In general, maximum a posterior (MAP) of 0 can be esti¬ 
mated by optimizations of logarithm joint distribution w.r.t. 
each parameters alternately. However, due to the property 
of point estimation, MAP is still prone to overfitting. In con¬ 
trast, we aim to infer the posterior distributions of 0 under 
a fully Bayesian treatment, which is p(Q\y) = j p(o y)dO • 

4.2 Model inference 

The model learning can be performed by the approximate 
Bayesian inferences when the posterior distributions are 
analytically intractable. Since the variational Bayesian in¬ 
ference (43) is more efficient and scalable as compared to 
sampling based inference methods, we employ VB tech¬ 
nique to learn both BTD-T and BTD-L models. It should be 
noted that the hierarchial Laplace priors are not conjugate. 


~ Ga(aQ, bl), 



~ Ga(abfi), 

~ V(0, Vn,Vhi, 

~ Ga (do, bo), Vn,Vr„, 

~ IG{ 1, |), Vn,Vr„, 

- Ga(al bl). 


resulting in a challenging problem to be addressed. In this 
section, we present the main solutions for model inference 
while the detailed derivations and proofs are provided in 
the Appendix. 

VB inference aims to seek an optimal q(Q) to ap¬ 
proximate the true posterior distribution in the sense 

of mmKL(q(G)\\p(e\y)). Since KL(q(e)\\p(e\y)) = 
ln p(y) — L(q), the optimum of q(Q) can be achieved by 
maximization of lower bound L(q) that can be computed 
explicitly. To achieve this, we assume that the variational 
approximation posteriors can be factorized as 

q(0) = q(G)q(P) g(U (n) ) g(A (n) )g( 7 )g(r). (13) 

n n 


Then, the optimized form of jth parameters based on 

max g(e .) L(q) is given by 


qj(@j) oc exp {E 9( e\ ej) [ln p(y, 0)]} . (14) 

Eg(©\© j .)[-] denotes an expectation w.r.t. q distribution over 
all variables in 0 except ©j. In the following, we use E[-] to 
denote the expectation w.r.t. g(0) for simplicity. 

As can be derived, the variational posterior distribution 
over the core tensor G is 

q(G) = V (vec(<7)|vec((J), Eg) , (15) 

where the posterior parameters can be updated by 

vec(g) = E[r] S g ^(g) E [u (n)T ] ^ vec (J7), (16) 

Eg = | E[/3] 0 E [A (n) ] + E[r] 0 E [u (n)T U (n) ] | . (17) 

t n n ) 

Most of expectation terms in ( [T6) >, (T7| ) are the functions 
linearly related to the corresponding Qj, which can thus be 
easily evaluated from their posterior q(@j). For example, 
E[LJ( n ) T ] can be evaluated according to g(LJ( n )) shown 
in (l8| , as can be similarly computed for E[r], E[/?], and 
E[AV^]. It should be noted that the expectation involv¬ 
ing a quadratic term can be evaluated explicitly by using 
E[U( n ) T U( n )] = E[U( n ) T ]E[U( n )] + I n ^ n \ which requires 
the posterior parameters given in (lSf . 

The computational complexity for posterior update of 
G is O (n n Rn + Yin InRn) that polynomially scales with 
model complexity denoted by n n Rn and linearly scales 
with data size denoted by n n In- I n general, it is dominated 
by O (E[ n Rn)' w hich is related to the matrix inverse. The 
memory cost is O (II n Rn + Yl n I n R n ), dominated by 
and 0 n (•). It should be noted that multilinear operations 
y x! E[U^ T ] x • • • x N E[U (iV)T ] can be performed without 
explicitly computing 0 n E[U^ n ^ T ], thus reducing the mem¬ 
ory cost to O (rin R l + Yl n Q and also reducing computa¬ 
tion complexity to 0(Yl n Rn + EL In)- To further improve 
computation and memory efficiency, we introduce several 
important Theorems as follows. 
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Theorem 4.1. Let {5]^} be a set of diagonalizable matrices , 
and ci , C 2 denote arbitrary scalars. lf\/n = 1,..., N, the spectral 
decompositions are represented by V^ n ^ T , then 


Ci I + C 2 (y\) 5] 


(n) 


)V (n) I ( ciI + c 2 CR)D (n) 


ci 6?)A (u) +C 2 (V)S 


(n) 


v (ri) 2 A/ ( n ) 


ClI + C 2 (X) D 


(n) 


Proof See Appendix for the detailed proof. 


\yO) T 2 


□ 


Based on Theorem |4.1| and |4.2| can be factorized as 
the product of sequential Kronecker products, which leads 
to that matrix inverse operations can be performed by indi¬ 
vidual eigenvalue decompositions on N small matrices, and 
inverse operations only on a diagonal matrix. Therefore, the 
computational complexity for inference of Q can be signif¬ 
icantly reduced to O (J2 n + Tl n In) while the memory 
cost can be significantly reduced to 0(^f n R 2 + Yl n I n ), if 
we save by a format of sequential Kronecker products. 

As can be derived, the variational posterior distribution 
over the factor matrices {tE 72 )} is represented by 


71 / \ 

q( U (n) )= n^K ) KT^)),n = l,...,7V, (18) 

in — 1 

where the posterior parameters can be updated by 

u (n) = E[r] Y(„) ( (g)E [u (fc) ] ] E [cf n) ] '£ (u) , (19) 

t (n) = J E[A (n) ] + E[t]E 


T (fc)T TT (fc) 


G(n) ( Q$)U WJ V 

k^n 


G 


(n) 


( 20 ) 

In J20l, the most complex expectation term related to 
multilinear operations can be computed by 


vec < E 


G 


(n) 


)U (fc)T U (fc) Gf n) 


k^n 

= E [G (n) <g> G (n) ] vec 



Jj(k)T jj(k) 


|) 


( 21 ) 


Thus, each posterior expectation term can be evaluated 
easily according to the corresponding posterior distributions 
q(G) and {^(U^)}, k « 1,..., TV, k n. 


r(n)T 


Proof See Appendix for the detailed proof. □ 

Theorem 4.2. Let {} be a set of diagonal matrices, } 
be a set of diagonalizable matrices , and c \, C 2 be scalars. If 
\/n = 1,..., N, the spectral decompositions are represented by 

^Y(n) _ 2 ^,(n) 7 Y(n) _ 2 _ y(n)j)(n)y(n)T^ 


Taken into account the computation and memory ef¬ 
ficiency, multilinear operations and sequential Kronecker 
products format must be employed to avoid explicitly 
computation of sequential Kronecker products. It should 
be noted that (2l) cannot be factorized into operations 
on individual kronecker terms because of E [G( n ) 0 G( n )]. 
To reduce the memory cost, we may approximate it by 
E [G( n )] ®E [G( n )]. Therefore, the computational complex¬ 
ity for inference of TJ( n ) can be improved to 0(Rf l + n n In) 
while the memory cost is O (n n Rn + EL In)- 

The variational posterior distribution over (3 can be 
derived to be a Gamma distribution due to its conjugate 
prior, which is denoted by q(/3) = Ga^a^^b^) where the 
posterior parameters can be updated by 


J _ 


X M 


- a o +9 n Rn 


b M = b o + [vec(0 2 ) T 


E |a ( " } 


( 22 ) 


Inj22l),E vec(^ 2 ) T = vec(E[C/] 2 ) T + diag(£G) T should be 


applied for rigorous inference, whereas an alternative ap¬ 
proximation is E |vec(C/ 2 ) T j = vec(E[C/] 2 ) T for efficiency. 

{E[A^]} can be easily evaluated according to {g(A^)} 
described in the following paragraphs. The computational 
complexity in (221 is 0([\ n R n ). 

The inference of hyperparameters {A^} plays a key 
role for automatic model selection (i.e., determination of 
multilinear rank). In BTD models, as we proposed two hier¬ 
archical sparsity priors, resulting in two different posterior 
distributions for {A^}. 

BTD-T model using hierarchial Student-t priors. As can be 
derived, the variation posterior distribution over {A^} 
is i.i.d. Gamma distributions due to the conjugate priors, 
which is Vn = 1,..., N, 

Rn 

q( A (n) ) = ]J Ga(A£0|a£>,6£>), (23) 

r n = 1 

where the posterior parameters can be updated by 


= a* + l 




\ k^n 

&<"> = b o+l^ |u^ )T u ( r n) 
+ ^E[/3]E [vec(0 2 . r 


(24) 


I E 


k^n 


[a«] 


According to { q{\J (n , ) } described in (181, we obtain that 

E[u.W T uW]»/,,(* (n) ) +u(; )T u(;>. (25) 

This represents the posterior expectation of squared 1/2- 
norm of rth component in mode-n factors, which also takes 
into account the uncertainty information. E |vec(£/ 2 . rn ...) T J 
represents the posterior expectation of squared Z^-norm 
of r n slice of core tensor Q. Therefore, an intuitive inter¬ 
pretation of automatic model selection is that the smaller 


of ||u^l|^ and \\Q... rn ... ||^ leads to larger E[AJC ; 1 and 


d n )i 


updated prior for p(u^A^) as well as p(Q... Vri ... |A^), 
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which in turn enforces more strongly and Q to 
be zero. After several iterations, the unnecessary columns in 
factor matrices and unnecessary slices in core tensor can be 
reduced to exact zero (i.e., smaller than machine precision). 
The computational complexity for inference of g(A^) is 
0(Yl n Rn + InRn)' Given updated parameters for q( A^), 
we can evaluate the posterior expectations by E[A^] = 

[4 n) M n) , • • J and E[A (n) ] = diag(E[A (n) ]). 

BTD-L model using nierarchial Laplace priors. Since the 
hierarchical Laplace prior is not conjugate, which leads to 
much difficulties for model inference. To solve this prob¬ 
lem, we employ a generalized inverse Gaussian distribu¬ 
tion, denoted by GIG(x\h,a,b), which includes Gamma, 
inverse Gamma, and inverse Gaussian distribution as spe¬ 
cial cases by an appropriate setting of hyperparameters. 
For example, one can show that Ga(a,b) = G/G(a, 2&, 0) 
and JG(a, b ) = G/G(—a, 0, 2b). For BTD-L mode, we keep 
using Ar^ ^ IG( 1 , as the hyper-prior, however the 
variational posterior q( cannot be represented as an 
inverse Gamma distribution. 

As can be derived, the variation posterior distribution 
over {A^} can be represented as i.i.d. GIG distributions, 
which is Vn = 1 ,..., N, 


q( A (n) ) = j} GIG(g$\h,aM,bM), (26) 


r n — 1 


(27) 


where the posterior parameters can be updated by 

h=l[ln+U R x) - 1 ’ 

\ k^n J 

4”) = E[/3]E [vec (0 2 . r „...) T ] 0E [A« fc >] 

k^n 

+ E [uJfuW] . 

The computational complexity for inference of q( A^) is 
also 0{Un R n + I n R n)- Given the updated parameters, 
we can evaluate ^GiG^r^] straightforwardly, while an 
alternative approximation is the posterior mode w.r.t. GIG 
distribution that can avoid computational instabilities of 
modified Bessel function. 

By comparing |26| with (23) , we can investigate the es¬ 
sential difference between Student-t and Laplace priors. One 
can show that (231 can be rewritten as GIG (a$, 0^ 

with parameters given by ([24}. Hence, the key difference lies 
in the setting of 7 . If 7 = 0, Student-t and Laplace priors are 
essentially equivalent. To avoid manually tuning parame¬ 
ters, we also place a hyper-prior over 7 and thus derive 
the variational posterior distribution as q( 7 ) = Ga(a^-, &Xr) 
whose posterior parameters can be updated by 


N 


a M — a 0 + ^ n ’ 

n- 

1 


n= 1 

Ki = b o + 2 E E E ] • 

n= 1 r n = 1 


(28) 


(n)~ 


It should be noted that E[X[ 
straightforwardly by E[A^] 1 


] cannot be computed 
Since q(\$) is a GIG 


distribution as shown in ( [26} , it is not difficult to derive 
that q(\$ ) = G/G(—ft,a^), yielding the posterior 
expectation computed by 


IE gig j 



(29) 


and the posterior mode computed by 


arg max GIG 

^rT 1 



(-ft - 1) + \f (-ft - l) 2 + ai^biy 


(30) 


where Ki-h( m ) denotes a modified Bessel function of the 
second kind. 

As can be derived, the variational posterior distribu¬ 
tion over the noise hyperparameter is q(r) = Ga(a T M , b r M ) 
whose parameters can be updated by 



where the posterior expectation of model residuals can be 
evaluated by 


E 


vec(y) — ( 0 U ( "^vec(e) 


\\yf F - 2vec(y) T |^(g)E[U (rl) ]J E[vec(C)] 

+ Tr [vec(£)vec(£) T ] (g)E[u ( " )T U (n) ]j 


(32) 


In principle, Ejvec(£/)vec(£/) T ] = vec(Q)vec(Q) T + E^. 
However, vec(£/)vec(£/) T can be alternatively used as an 
approximation, which then makes it possible to apply 
multilinear operations for computing ( [32} quite efficiently. 
Hence, the computational complexity can then be reduced 
toG(nn^n+n n ^n). 

The inference framework presented in this section can 
essentially maximize the lower bound of model evidence 
which is defined by C(q) = E g (@) [\np(y, @)] + H(q(Q)). 
The first term denotes the posterior expectation of joint 
distribution while the second term denotes the entropy of 
g(@). In principle, L(q) should increase at each iteration, 
thus it can be used to test for convergence. We provide the 
detailed computation forms of L(q) in the Appendix. 


5 Bayesian Tucker Model for Tensor Com¬ 
pletion 

5.1 Model specification 

In this section, we consider Bayesian Tucker model for 
tensor completion. Let y denotes an incomplete tensor 
(i.e., with missing entries), and O denotes a binary tensor 
indicating the observation positions, i.e., Oi 1 ...i N = 1 if 
(hi... ^n) £ ^ otherwise it is zero. Ll denotes a set of TV¬ 
tuple indices of observed entries. denotes only observed 
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entries. Similar to BTD model, we assume a generative 
model y q = Xq + e where the latent tensor X can be rep¬ 
resented exactly by a Tucker model with a low multilinear 
rank and e denotes i.i.d. Gaussian noise. 

Given an incomplete tensor, Bayesian Tucker model only 
considers the observed entries, yielding a new likelihood 
function represented by 

p{yn |{U<">},a,r)- H V(y il ... iN |^ 1 ... iN ,r- 1 ). 

Based on Tucker decomposition framework, we can thus 
represent the observation model as that V(ii,..., i^), 

0 U in )T ) VeC (£)< T_1 

(33) 

For Tucker decomposition of an incomplete tensor, the 
problem is ill-conditioned and has infinite solutions. The 
low-rank assumption play an key role for successful tensor 
completion, which implies that the determination of multi¬ 
linear rank significantly affects the predictive performance. 
However, standard model selection strategies, such as cross- 
validation, cannot be applied for finding the optimal mul¬ 
tilinear rank because it varies dramatically with missing 
ratios. Therefore, the inference of multilinear rank is more 
challenging when missing values occur. 

As described in BTD model, we employ two types of 
hierarchical group sparsity priors over the factor matrices 
and core tensor with aim to seek the minimum multilinear 
rank automatically, which is more efficient and elegant than 
the standard model selections by repeating many times and 
selecting one optimum model. Therefore, the model priors 
for all hidden variables are same with that in BTD model. By 
combining likelihood model in {33} with the model priors 
in 0 we propose a Bayesian Tucker Completion (BTC) 
model, which enable us to infer the minimum multilinear 
rank as well as the noise level solely from partially observed 
data without requiring any tuning parameters. 

5.2 Model inference 

For BTC model, we also employ VB inference framework 
to learn the model under a fully Bayesian treatment. Since 
BTC model differs from BTD model in the likelihood func¬ 
tion |33| , indicating that the inference for factor matrices, 
core tensor and noise parameter are essentially different, 
while other hyperparameters can be inferred by the same 
solutions. In this section, we present only the main solu¬ 
tions while the detailed derivations are provided in the 
Appendix. 

As can be derived, the variational posterior distribution 
over the core tensor Q is 

q(Q) = M (vec(C/)|vec(d), S G ) (34) 

where the posterior parameters can be updated by 

vec(C?) = E[t] Tjg J2 k- w (g) E [uSl ) ] J (35) 

S G = | E[/3] (g)E [A (n) ] +E[r] ^ 0E [u£ ) u£ )T ] l 

( 36 ) 




It should be noted that Theorems |4.1| |4.2| and multilinear 
operations cannot be applied to ( [36} due to the sum of 
kronecker products. Thus the sequential kronecker products 
must be computed explicitly, resulting in the computational 
cost of O 0L Rn + M n n Rn)' where M denotes the num¬ 
ber of observed entries (i.e., data size), and memory cost 
of (D (Rn Rri) • This severely prevents the method from 
being applied to large-scale datasets. To improve scala¬ 
bility, we propose an alternative solution by optimizing 
argmmg{— In q{Q)} instead of closed-form update in (35) . 
This can be achieved by employing a nonlinear conjugate 
gradient method with the gradient given by 


+ E M E {((8> E [ u i! )u i! )T ]) vec (£)l 

-E[r] 5] (y il ... ijv (g)E[u^ ) ] N ) (37) 

Thus, multilinear operations can be applied without explic¬ 
itly computation of kronecker products, resulting in reduced 
computational complexity of 0{M\\ n R n ) and reduced 
memory cost of O (n n Rn), which scales linearly with data 
size and model complexity. 

As can be derived, the variational posterior distribution 
over is factorized as 

9(U W ) =n A/ '( u i" ) | 5 i" ) ’*i" ) )> n=l,...,N, (38) 


where the posterior parameters can be updated by 


2^ = E[t]»^E [G (n) ] 53 DV-i* 0 E [u«] 

y k^n 

(39) 

^3 = |E[A (n) ] + E[t]E [G (n) *|3Gf B) ] } _1 (40) 

where 4?[3 = 53 0) u T u T T ' summa tion is 


(ii,...,ijv)Gn k^n 

performed over the observed data locations whose mode-n 
index is fixed to i n . In other words, represents the 

statistical information of mod e-k (k ^ n) fatent factors that 
interact with . In ( [io} , the complex posterior expectation 
can be computed by 


vec {E [G (n) *|3Gf n) ] } = E [G (n) ® G (n) ] vec (^ } ) . (41) 

The computational complexity for inference of u- r ’^ is 
O {R^ + M n n Rn), while the memory cost is 0(]\ n R n ). 

An intuitive interpretation of |4Q| is that the posterior 
covariance combines prior information denoted by 

E[A^] and posterior information of interacted factors in 
other modes, while the tradeoff between these two terms is 
controlled by model residual E[r]. Hence, if updated E[A^] 
is quite large and model is not well fitted, then the posterior 
variance of r n th component will be easily forced to zero. 
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Given updated g(u^) from (381, we can easily compute 
the posterior expectations of some expressions as 


E[u ( fuW] = ufuW + £ (M 


(n)\ 

n / r n r n 


E[ui n) ui n)T l 


~(n)~(n)T T (n) 

U- U- + W • . 


(42) 


The variational posterior distribution over noise preci¬ 
sion r can be derived as q(r) = Ga(r \ a Mi &5f) whose 
parameters are updated by 


a M — a o + y Oi- 


b T M = bl + -E 


E 

(il ,...,ijv)CO 


y*, 


4" )T> ) vec (£)l 


(43) 


The posterior expectation of model residuals over observed 
entries can be evaluated by 


E 


y V ( yii-'-iN 


MT 


vec(Q) 


M 2 f-2 | £ ^n... ijv (g)E[u^ )T ] jE[vec(C)] 

+ Tr l E [vec(£)vec(e) T ] 


(«1 n 


T (ri) 

[ U <„ U - 


(44) 


The computational complexity is 0(M Y\ n R n ), if multilin¬ 
ear operations have been applied. 

Note that other hidden hyperparameters in 0 except 
{0,{U< n > },t} can be inferred essentially by the same so¬ 
lutions with BTD model. In addition, the lower bound 
C(q) = E g (©) [In p(yn, 0)] + H(q(&)) can be also evaluated 
with different expressions related to {£/, {U^ n ^},r} (see 
Appendix for details). 

The predictive distributions over missing entries, given 
observed entries, can be approximated by using variational 
posterior distributions q(Q) as follows 


pOV-iJTn) = J PW 1 ...i W |0)p(e|y n )de 

« V(3^ 1 ... ijv |y* 1 ...* jv , E[r] _1 +<7?.,. ijv ) , 

where the posterior parameters can be obtained by 

£) E K )T ]) E[vec(C/)] 


(45) 


3^1 •••ijV 

4„. ijv = Tr ( E [vec(^)vec(^) T ] (g)E [u^u^ )T ] 

\ n 

-E[vec(£)] T ((g) E [u£>] E [u^ )T ] J E[vec(0)]. 


(46) 


Therefore, our model can provide not only predictions over 
missing entries, but also the uncertainty of predictions, 
which is quite important for some specific applications. 


6 Algorithm Related Issues 

We denote Bayesian sparse Tucker models (i.e., BTD and 
BTC) when two different group sparsity priors have been 
employed by BTD-T, BTD-L, BTC-T, BTC-L respectively, 
where T, L represent a hierarchical Student-t and Laplace 
priors. The main procedure of model inference is described 
as follows: 

• Setting of top-level hyperparameters. For BTD- 

T and BTC-T models, {aj, > a o > > a o ? &o} are 

set to le-9. For BTD-L and BTC-L models, 
{aj, a o ? 5 a o? ^ol are set to l e_ 9- This setting 

results in a noninformative prior, which ensures that 
model inference is solely based on observed data. 

• Initialization of hidden parameters 0. For BTD-T 
and BTC-T, © = {Q, {U( n >}, {A (n) },r,/?}. For BTD- 
L and BTC-L, © = {Q, {U< n >}, {A (?l) }, r, /3, 7 }. Q 
does not need to be initialized due to that it will 
be updated firstly. The latent dimensions can be 
initialized by R n = 7 n , Vn or manually by R n < 1 n . 
{U<"> } can be initialized as left singular vectors by 
mode-n SVD, while an alternative one is sampling 
from A/"( 0 , 1 ). r is initialized by 1 /oy denoting an 
inverse of data variance. {A^}, /?, 7 are simply ini¬ 
tialized to be 1 . 

• Variational model inference. For all the models, the 
approximate inference can be updated sequentially 
in the order of {$, {U^ n ^},/3, {A^}, 7 ,t}. 

• Model reduction. Pruning out zero-valued latent 
components from {U^ n ^} as well as the associ¬ 
ated slices from Q and associated components from 
{A (n) }. 

• Lower bound of model evidence. The lower bound 
of log-marginal likelihood is evaluated according 
to specific models, which can be used to test for 
convergence. 

• Predictive distribution. For BTD models, the pre¬ 
dictive distribution can be inferred for recovering 
latent tensor without noises. For BTC modes, it can 
be inferred for recovering missing entries. 

For BTD-T and BTD-L models, the overall computational 
complexity is 0(^2 n R^ + N\[ n I n ), while the memory 
cost is 0(Y\ n R n + U n In)- F° r BTC-T and BTC-L mod¬ 
els, the overall computational complexity is 0(^2 n I n R^ + 
M \[ n R nY.n I n), while the memory cost is 0{\[ n R n + 
I n R n ). Therefore, BTD models are more computational 
efficient than BTC models, but it requires more memory. The 
computational complexity of all models scales linearly with 
data size, but polynomially with model complexity. Hence, 
our models are suitable for relatively low multilinear rank 
tensors. It should be noted that, because of automatic model 
reduction, {Ri, ..., R^} reduces rapidly in the first few 
iterations, resulting in that computational complexity will 
be decreased with number of iterations. 

The key advantage of our models is automatic model 
determination (i.e., learning multilinear rank), which enable 
us to obtain an optimal low-rank Tucker approximation 
from a noisy and incomplete tensor. Secondly, taking into 
account uncertainty information of all model parameters 
by full posterior inference, our method can effectively pre- 
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vent overfitting problem and provide predictive uncertainty 
as well. Thirdly, a deterministic Bayesian inference with 
closed-form update rules is derived for model learning, 
which is more efficient and scalable than sampling based 
inference. Finally, our methods are significantly convenient 
for practical applications since they do not require any 
tuning parameters. 


BTD-T BTD-L ARD-Tucker 



30 15 0 30 15 0 30 15 0 


SNR (dB) SNR (dB) SNR (dB) 

Fig. 2. Accuracy of inferred tensor rank under conditions of varying 
SNR and R. White area indicates successful estimations and black area 
indicates failed cases. 


7 Experimental Results 

We evaluated the proposed models and algorithms by ex¬ 
tensive experiments and comparisons with state-of-the-art 
related works which include HOOI (9), ARD-Tucker 1181, 
iHOOI (31), WTucker (20], HaLRTC [28). In all our exper¬ 
iments, the parameters of HOOI, WTucker, and HaLRTC 
were tuned carefully and the best possible performances 
were reported. ARD-Tucker and iHOOI can automatically 
adapt tensor rank, which is similar to BTD and BTC meth¬ 
ods proposed in this paper. 

7.1 Synthetic data 

The simulated data was generated according to an or¬ 
thogonal Tucker model. More specifically, the core tensor 
was drawn from Q ri r 2 r 3 ~ A/*(0, 1 ), Vr n = 1,..., R n . The 
factor matrices were drawn from ^ A7(0 ,1 ),Vi n = 

1,..., 50, Vr n = 1 ,... i? n , Vn = 1,2,3, and then orthogo- 
nalization was performed such that lj/ n ) T U( n ) = I r u . In 
principle, such data can be perfectly fitted by HOSVD or a 
more powerful method HOOI. 




(a) HOOI 


(b) BTD-T 



(C) BTD-L 



(d) ARD-Tucker 


Fig. 1. [(a)] shows performance baseline using tru e tensor r ank under 
varying noise levels (SNR) and tensor rank (R). |(b][c][d)| show dis¬ 
crepancy between baseline and three methods which can infer rank 
automatically. 


We evaluated BTD-T/L on complete tensors under vary¬ 
ing noise levels with SNR ranging from 30dB to -5dB 
and true tensor rank (iih, i? 2 , ^ 3 ) ranging from 5 to 25. 
The reconstruction performance was evaluated by RRSE 
= ||AT — where X denotes the estimation of 

true latent tensor X . Theoretically, the best performance 
should be obtained by HOOI given true tensor rank. As 
shown in Fig. [I] RRSE of HOOI increases exponentially with 


BTD-T BTD-L 



25 15 5 -5 25 15 5 -5 

SNR (dB) SNR (dB) 


Fig. 3. Absolute error of inferred SNR. The maximum error is close to 
IdB when R=(25, 25, 25) and SNR= -5dB. 


noise levels and larger R leads to a more rapid degener¬ 
ation of performance. It should be emphasized that BTD- 
T/L achieve surprisingly improved performances without 
knowing true tensor rank, and the improvements become 
more significant for larger R and smaller SNR. The RRSEs 
obtained by BTD-T and BTD-L are almost equal under all 
conditions. ARD-Tucker, which can also automatically infer 
tensor rank, however obtained significantly degenerated 
performance when SNR<10dB. The accuracy of rank de¬ 
termination is compared in Fig. [2j showing that BTD-T/L 
are able to exactly infer the tensor rank only except the 
cases of SNR=-5dB, R>=25 and outperform ARD-Tucker. 
In addition, BTD-T/L can infer noise levels accurately as 
shown in Fig. [3] The averaged runtime of BTD-T/L are 4/3 
seconds while ARD-Tucker needs 47 seconds. 

We evaluated BTC-T/L on incomplete tensors under 
varying missing ratios (MR) ranging from 50% to 90% and 
true rank {R n } ranging from 5 to 15, while SNR was fixed 
to 30dB. As shown in Fig. |4j BTC-T/L perform similarly un¬ 
der all conditions and significantly outperform iHOOI and 
HaLRTC. The RRSE increases with MR and larger R leads 
to a more rapid increase. iHOOI and HaLRTC, which are 
able to optimize tensor rank base on nuclear norm, perform 
well on relatively low rank tensors but degenerates rapidly 
with increasing rank. It should be emphasized that BTC- 
T/L exactly inferred tensor rank with 100% accuracy under 
all these conditions. As shown in Fig. [5j BTC-T/L can also 
accurately infer SNR when either true rank or MR is low, 
and the runtime depends on true rank. In summary, if both 
MR and true rank increase, the completion performance will 
decrease. 

7.2 Chemometrics data 

We evaluated BTD and BTC models on chemometrics data 
analysi^Jin terms of recovering the number of latent compo¬ 
nents which have meaningful interpretation and imputing 
missing data. Five tested data sets are Amino Acids Fluores¬ 
cence (5 x 201 x 61), Flow Injection (12 x 100 x 89), Eletronic 
Nose (18 x 241 x 12), Sugar Process (268 x 571 x 7), and 

1. Datasets are available from repository for multi-way data analysis, 
http:/ / www.models.kvl.dk/datasets 
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TABLE 4 

Global tensor completion for MRI data. 



50% 

80% 

PSNR 

RRSE 

PSNR 

RRSE 

BTC-T 

26.40 

0.12 

22.33 

0.19 

WTucker 

19.42 

0.27 

21.53 

0.21 

iHOOl 

21.57 

0.21 

19.77 

0.26 

HaLRTC 

23.12 

0.18 

17.06 

0.36 


0.1 
0.08 
lu 0.06, 

co 4 
cc 

DC 0.04 


0.02 


50 




60 70 80 90 

Missing ratio (%) 

(C) iHOOl 



Fig. 4. Tensor completion performances obtained by different methods 
under varying conditions of MR and true rank. The legend for all figures 
are same. 


SNR estimation 



50 60 70 80 90 

Missing ratio (%) 


Runtime (s) 



50 60 70 80 90 


Missing ratio (%) 


Fig. 5. Absolute error of inferred SNR whose maximum is 2dB when R = 
(15,15,15) and MR = 90%. The runtime is proportional to the true rank. 


Enzymatic Acivity (3 x 3 x 3 x 3 x 5). For Amino Acids 
Fluorescence data, each sample contains different amounts 
of tyrosine, tryptophan and phenylalanine dissolved in 
phosphate buffered water and measured by fluorescence 
on a PE LS50B spectrofluorometer. Although each sample 
is represented as a full rank matrix, leading to a full rank 
tensor. Ideally it should be describable by three latent com¬ 
ponents corresponding to individual amino acids. 

We tested BTD-T/L methods on the original data as 
well as noise corrupted data (SNR=0dB). Since the standard 
Tucker decomposition (HOOI) needs a predefined tensor 
rank, it can be specified as the inferred rank from BTD-T. As 
shown in Table [l] tensor rank inferred by BTD-T is (3, 3, 3), 
indicating that the underlying components associated to 
three amino acids are successfully captured. In addition, 
BTD-T yields a stable rank estimation even for highly noisy 
data. The performances of BTD-T/L are similar and even 
better than HOOI in most cases, although the same rank was 
employed. As compared to ARD-Tucker, BTD-T/L always 
obtain better performance and are more robust to noise. 
The computation efficiency of BTD-T/L is also significantly 
higher than ARD-Tucker. 

We also tested BTC-T/L methods by randomly selecting 
90% data for training, and the remainder was used as test 
case. We repeated this partition 10 times. Table [2] presents 
predictive performances and stability of inferred rank. Note 
that HaLRTC is sensitive to tuning parameters, and the best 
possible performances by searching tuning parameters in 
a wide range are reported. iHOOl is able to learn tensor 
rank automatically, however, the inferred rank is unsta¬ 
ble and sensitive to different partitions. In contrast, BTC- 


T/L achieved the best predictive performance on all five 
datasets, while the inferred rank is stable under any specific 
missing ratio. 

7.3 MRI data 

In this section, we evaluate BTC-T/L methods by MRI 
dataQ particularly for the generalization performance when 
original data does not posses a global low-rank structure. 
Since the dimensions and rank of MRI data are generally 
high, predictions of missing voxels from sparse observations 
and scalability of methods become more challenging. One 
simply assumption is that a tensor, which has not a globally 
low-rank structure, can be separated into smaller block 
tensors without overlapping and some block tensors may 
has a low-rank structure (locally low-rank). Hence, we can 
apply tensor completion to block tensors independently, and 
this strategy is also useful to handling a large-scale tensor. 
For HaLRTC, the tuning parameters were carefully selected 
to yield the best possible performances. BTC methods as 
well as iHOOl can automatically adapt tensor rank to data. 
These methods were applied to MRI data completion with 
varying missing ratios and noise conditions. As shown in 
Table [3j BTC-T/L achieve the best performance under all 
different conditions. When missing ratio is low (e.g., 50%), 
HaLRTC outperforms iHOOl, while iHOOl outperforms 
HaLRTC when missing ratio is larger than 60%. Since high 
quality of MRI images is required for medical diagnosis, 
higher missing ratios (e.g. > 90%) are not tested. In addition, 
we also applied these methods to MRI data globally for 
comparisons. WTucker was also used for evaluations, while 
its tensor rank was specified as the inferred rank by BTC 
method. As shown in Table |4j BTC always outperforms 
the other methods under different missing ratios. Note 
that WTucker can possibly obtain better performance than 
iHOOl and HaLRTC when tensor rank was specified by 
inferred rank from BTC under 80% missing ratio. From 
Table 3j [4j we can observe that when missing ratio is large, 
global tensor completion yields better performance than 
local tensor completion strategy. However, global tensor 
completions are limited by data size. The visual quality is 
shown in Fig. [6] which was obtained by BTC-T method. 

8 Conclusion 

We have proposed a class of Bayesian Tucker models for 
both tensor decomposition and completion. To modeling 
structural sparsity, we introduce two group sparsity induc¬ 
ing priors by the hierarchical representations. The model in¬ 
ferences, especially for the non-conjugate Laplace priors, are 

2. Dataset is available from http://brainweb.bic.mni.mcgill.ca/ 
brainweb/ 
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TABLE 1 

Chemometrics data analysis by tensor decompositions. The SNR of noisy data is OdB. R denotes inferred tensor rank. N/A 

indicates an inapplicable case. Runtime is measured by seconds. 




Amino 

Flow 

Electronic 

Sugar 

Enzymatic 



Original 

Noisy 

Original 

Noisy 

Original 

Noisy 

Original 

Noisy 

Original 

Noisy 

HOOI 

RRSE 

0.0245 

0.0960 

0.0321 

0.0827 

0.0328 

0.0664 

0.0386 

0.0839 

0.1677 

0.2174 


R 

(4, 3, 3) 

(6, 3, 3) 

(5,4,3) 

(12, 48, 49) 

a, i, i) 

(3, 2, 3) 

(5,8,7) 

(5,8,7) 

N/A 

N/A 

ARD-Tucker 

RRSE 

0.0260 

0.0961 

0.0476 

0.4377 

0.0328 

0.1006 

0.0387 

0.0788 

N/A 

N/A 


Runtime 

32 

38 

29 

275 

22 

49 

138 

145 

N/A 

N/A 

BTD-T 

R 

(3, 3, 3) 

(3, 3, 3) 

(3, 5, 3) 

(3,4, 3) 

(h 2,1) 

(1, 46,1) 

(6, 24, 6) 

(5,11, 6) 

(1,1,14,1) 

(1,1444) 

RRSE 

0.0245 

0.0950 

0.0321 

0.0812 

0.0328 

0.0664 

0.0387 

0.0788 

0.1677 

0.2216 


Runtime 

2 

2 

2 

2 

2 

5 

15 

15 

3 

8 

BTD-L 

R 

(3, 6, 3) 

(3, 3, 3) 

(3, 5, 3) 

(3,4, 3) 

(h b 1) 

(1, 24,1) 

(7, 25, 6) 

(5,11, 6) 

(1,1,1,14) 

(144,1,2) 

RRSE 

0.0222 

0.0950 

0.0321 

0.0812 

0.0328 

0.0664 

0.0353 

0.0789 

0.1677 

0.2209 


Runtime 

5 

4 

2 

4 

2 

5 

15 

15 

8 

7 


TABLE 2 

Chemometrics data completion with 90% missing ratio. Std(R) denotes standard deviation of inferred tensor rank. 



Amino 

Flow 

Electronic 

Sugar 

Enzymatic 


Std(R) 

RRSE 

Std(R) 

RRSE 

Std(R) 

RRSE 

Std(R) 

RRSE 

Std(R) 

RRSE 

HaLRTC 

N/A 

0.35±0.01 

N/A 

0.12± 0.02 

N/A 

0.06±0.01 

N/A 

0.18±0.00 

N/A 

0.68±0.04 

iHOOI 

(0,51,17) 

0.64±0.08 

(0.3,0.3,0.3) 

0.44±0.32 

(5,38,2) 

0.04±0.01 

(0,2,0) 

0.64±0.00 

(0,0,0,0.3,0.5) 

0.89±0.49 

BTC-T 

(0.4,0.4,0) 

0.03±0.00 

(0.5,0.6,0.4) 

0.10±0.04 

(0,0,0) 

0.03±0 

(24-5,0) 

0.07±0.01 

(0,0,0,0,0) 

0.21±0.03 

BTC-L 

(0.5,0.4,0) 

0.026±0.00 

(Q.5,0.5,0.3) 

O.lldzO.04 

(0,0,0) 

0.04 ±0 

(1.4,0.7,0) 

0.07±0.01 

(0.4,0,0,0,0) 

0.21±0.03 


TABLE 3 

The performance of MRI completion evaluated by PSNR and RRSE. For noisy MRI, the standard derivation of Gaussian noise is 
3% of brightest tissue. MRI tensor is of size 181 x 217 x 165 and each block tensor is of size 50 x 50 x 10. 



! 50% 

60% 

70% 

80% 


Original 

Noisy 

Original 

Noisy 

Original 

Noisy I 

Original 

Noisy 

BTC-T 

27.32 

0.11 

26.18 

0.12 

25.30 

0.14 

24.60 

0.15 

22.81 

0.18 

22.35 

0.19 

20.14 

0.25 

20.00 

0.25 

BTC-L 

26.91 

0.11 

25.57 

0.13 

24.84 

0.15 

23.95 

0.16 

22.76 

0.19 

22.09 

0.20 

20.12 

0.25 

19.80 

0.26 

iHOOI 

22.69 

0.19 

21.45 

0.22 

22.47 

0.19 

21.16 

0.22 

21.63 

0.21 

20.11 

0.25 

18.65 

0.30 

17.89 

0.32 

HaLRTC 

24.84 

0.15 

23.60 

0.17 

22.35 

0.19 

21.65 

0.21 

19.93 

0.26 

19.55 

0.27 

17.37 

0.34 

17.15 

0.35 


SNR=20dB, MR = 50%, PSNR= 26dB SNR=20dB, MR = 80%, PSNR= 22dB 



(a) 50% missing (b) 80% missing 


Fig. 6. Visualization of MRI data completion obtained by BTC-T. 


derived under variational Bayesian framework. Our models 
can infer an optimal multilinear rank from whether a fully or 
partially observed tensor by automatical model reduction, 
yielding significant advantages for tensor completion. For 
algorithm implementation, we propose several Theorems 
related to multilinear operations to improve computational 
efficiency and scalability. Empirical results on synthetic 
data as well as chemometrics and neuroscience applications 
validated the superiority of our models in terms of tensor 
decomposition and completion. 
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